I am going to work on GapMinder Education data. Solely education data doesn't give the much needed insights hence I am going to look at Economy and Employability Indicators as well as get insights on the relationship among each other.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
%matplotlib inline
The data has the gender parity index for gross enrollment in primary and secondary education. This is the ratio of girls to boys enrolled at primary and secondary grades of public and private schools.
Source : World Bank
No of countries : 204
df_gend_ratio_pr_sec_enrollment = pd.read_csv("./Data/ratio_of_girls_to_boys_in_primary_and_secondary_education_perc.csv")
print("Total No of rows : ", df_gend_ratio_pr_sec_enrollment.shape[0])
print("Total No of columns : ", df_gend_ratio_pr_sec_enrollment.shape[1])
df_gend_ratio_pr_sec_enrollment.head()
Total No of rows : 204 Total No of columns : 53
| country | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Afghanistan | 0.167 | 0.161 | 0.161 | 0.169 | 0.167 | 0.174 | 0.181 | 0.192 | 0.199 | ... | 0.669 | 0.655 | 0.654 | 0.642 | 0.646 | 0.642 | 0.636 | NaN | NaN | NaN |
| 2 | Angola | NaN | 0.640 | 0.657 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Albania | NaN | NaN | NaN | NaN | NaN | NaN | 0.923 | NaN | 0.925 | ... | 0.990 | 0.982 | 0.977 | 0.982 | 0.994 | 1.000 | 1.020 | 1.02 | 1.02 | NaN |
| 4 | Andorra | NaN | NaN | NaN | NaN | NaN | 1.150 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 53 columns
The data consists of percentage of ratio of years spent by females to males in primary,secondary and tertiary education. It is collected for people of ages in the range of 25 to 34.
Source : Institute for Health Metrics and Evaluation (IHME), University of Washington
No of countries : 188
df_gend_ratio_pr_sec_ter_yrs = pd.read_csv("./Data/mean_years_in_school_women_percent_men_25_to_34_years.csv")
print("Total No of rows : ", df_gend_ratio_pr_sec_ter_yrs.shape[0])
print("Total No of columns : ", df_gend_ratio_pr_sec_ter_yrs.shape[1])
df_gend_ratio_pr_sec_ter_yrs.head()
Total No of rows : 188 Total No of columns : 47
| country | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 15.4 | 15.8 | 15.4 | 15.6 | 15.9 | 16.1 | 16.4 | 16.6 | 16.2 | ... | 21.5 | 21.9 | 22.2 | 22.3 | 22.6 | 22.9 | 23.1 | 23.4 | 23.5 | 23.7 |
| 1 | Angola | 51.3 | 51.4 | 51.9 | 52.3 | 52.8 | 53.2 | 53.4 | 53.8 | 54.3 | ... | 68.5 | 68.9 | 69.5 | 70.1 | 70.5 | 71.2 | 71.7 | 72.2 | 72.9 | 73.3 |
| 2 | Albania | 87.4 | 87.9 | 88.3 | 88.9 | 89.2 | 89.7 | 90.2 | 90.6 | 91.0 | ... | 100.0 | 101.0 | 101.0 | 101.0 | 101.0 | 102.0 | 102.0 | 102.0 | 102.0 | 103.0 |
| 3 | Andorra | 97.0 | 97.4 | 97.8 | 98.1 | 98.4 | 98.8 | 99.1 | 99.5 | 99.8 | ... | 105.0 | 105.0 | 105.0 | 105.0 | 105.0 | 105.0 | 106.0 | 106.0 | 106.0 | 106.0 |
| 4 | United Arab Emirates | 90.9 | 91.4 | 92.0 | 92.4 | 93.0 | 93.6 | 94.0 | 94.5 | 95.2 | ... | 105.0 | 105.0 | 105.0 | 105.0 | 105.0 | 105.0 | 105.0 | 105.0 | 106.0 | 106.0 |
5 rows × 47 columns
The data consists the percentage of all 15+ aged people who were employed that year.
Source : International Labour Organization
No of countries : 189
df_emp = pd.read_csv("./Data/aged_15plus_employment_rate_percent.csv")
print("Total No of rows : ", df_emp.shape[0])
print("Total No of columns : ", df_emp.shape[1])
df_emp.head()
Total No of rows : 189 Total No of columns : 31
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 42.5 | 42.5 | 42.5 | 42.5 | 42.4 | 42.4 | 42.3 | 42.2 | 42.2 | ... | 42.3 | 42.4 | 42.5 | 42.7 | 42.9 | 43.0 | 43.2 | 43.4 | 43.5 | 41.5 |
| 1 | Angola | 75.0 | 75.0 | 75.2 | 75.1 | 74.9 | 74.9 | 74.8 | 74.7 | 74.6 | ... | 71.7 | 71.8 | 71.8 | 71.9 | 71.9 | 72.0 | 72.1 | 72.1 | 72.1 | 69.6 |
| 2 | Albania | 57.8 | 58.2 | 56.8 | 55.7 | 54.1 | 53.3 | 54.5 | 53.8 | 52.7 | ... | 52.0 | 49.4 | 44.7 | 43.7 | 46.0 | 47.9 | 49.3 | 52.0 | 53.4 | 52.7 |
| 3 | United Arab Emirates | 71.8 | 72.2 | 72.9 | 73.4 | 73.8 | 73.3 | 73.1 | 73.3 | 73.7 | ... | 81.7 | 81.5 | 81.3 | 81.3 | 81.6 | 81.2 | 80.3 | 80.3 | 80.2 | 76.9 |
| 4 | Argentina | 57.3 | 56.9 | 54.9 | 54.0 | 49.5 | 50.7 | 52.5 | 54.1 | 53.1 | ... | 56.3 | 56.1 | 56.0 | 55.4 | 55.5 | 55.5 | 55.5 | 55.7 | 55.5 | 49.4 |
5 rows × 31 columns
The data consists of GDP per per capita which is calculated by the total amount (international dollars, fixed to 2017 prices) divided by the total population of the country.The data is adjusted for inflation and differences in the cost of living between countries, known as PPP dollars.
Source : Gapminder based on World Bank
No ofcountries : 210
df_income_per_person = pd.read_csv("./Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
print("Total No of rows : ", df_income_per_person.shape[0])
print("Total No of columns : ", df_income_per_person.shape[1])
df_income_per_person.head()
Total No of rows : 195 Total No of columns : 252
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 683 | 683 | 683 | 683 | 683 | 683 | 683 | 683 | 683 | ... | 2690 | 2750 | 2810 | 2870 | 2930 | 2990 | 3060 | 3120 | 3190 | 3260 |
| 1 | Angola | 700 | 702 | 705 | 709 | 711 | 714 | 718 | 721 | 725 | ... | 8000 | 8170 | 8350 | 8530 | 8710 | 8900 | 9090 | 9280 | 9480 | 9690 |
| 2 | Albania | 755 | 755 | 755 | 755 | 755 | 756 | 756 | 756 | 756 | ... | 25.1k | 25.6k | 26.2k | 26.7k | 27.3k | 27.9k | 28.5k | 29.1k | 29.7k | 30.4k |
| 3 | Andorra | 1360 | 1360 | 1360 | 1360 | 1370 | 1370 | 1370 | 1370 | 1380 | ... | 68.9k | 70.4k | 71.9k | 73.4k | 75k | 76.6k | 78.3k | 80k | 81.7k | 83.4k |
| 4 | United Arab Emirates | 1130 | 1130 | 1140 | 1140 | 1150 | 1150 | 1160 | 1160 | 1160 | ... | 101k | 103k | 105k | 107k | 110k | 112k | 114k | 117k | 119k | 122k |
5 rows × 252 columns
Firstly lets look at the datatypes of all dataframes and convert them to proper datatypes
print(df_gend_ratio_pr_sec_enrollment.info())
df_gend_ratio_pr_sec_enrollment.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 204 entries, 0 to 203 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 204 non-null object 1 1970 34 non-null float64 2 1971 116 non-null float64 3 1972 111 non-null float64 4 1973 108 non-null float64 5 1974 102 non-null float64 6 1975 98 non-null float64 7 1976 105 non-null float64 8 1977 107 non-null float64 9 1978 101 non-null float64 10 1979 99 non-null float64 11 1980 95 non-null float64 12 1981 105 non-null float64 13 1982 99 non-null float64 14 1983 100 non-null float64 15 1984 105 non-null float64 16 1985 101 non-null float64 17 1986 109 non-null float64 18 1987 101 non-null float64 19 1988 101 non-null float64 20 1989 100 non-null float64 21 1990 99 non-null float64 22 1991 100 non-null float64 23 1992 98 non-null float64 24 1993 108 non-null float64 25 1994 100 non-null float64 26 1995 93 non-null float64 27 1996 88 non-null float64 28 1997 70 non-null float64 29 1998 86 non-null float64 30 1999 142 non-null float64 31 2000 136 non-null float64 32 2001 134 non-null float64 33 2002 134 non-null float64 34 2003 130 non-null float64 35 2004 143 non-null float64 36 2005 147 non-null float64 37 2006 138 non-null float64 38 2007 141 non-null float64 39 2008 137 non-null float64 40 2009 135 non-null float64 41 2010 135 non-null float64 42 2011 141 non-null float64 43 2012 136 non-null float64 44 2013 123 non-null float64 45 2014 132 non-null float64 46 2015 133 non-null float64 47 2016 129 non-null float64 48 2017 127 non-null float64 49 2018 123 non-null float64 50 2019 115 non-null float64 51 2020 56 non-null float64 52 2021 3 non-null float64 dtypes: float64(52), object(1) memory usage: 84.6+ KB None
| 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 34.000000 | 116.000000 | 111.000000 | 108.000000 | 102.000000 | 98.000000 | 105.000000 | 107.000000 | 101.000000 | 99.000000 | ... | 136.000000 | 123.000000 | 132.000000 | 133.000000 | 129.000000 | 127.000000 | 123.000000 | 115.000000 | 56.000000 | 3.000000 |
| mean | 0.846285 | 0.774966 | 0.822279 | 0.837435 | 0.833902 | 0.863673 | 0.852486 | 0.858607 | 0.862950 | 0.876051 | ... | 0.981993 | 0.990081 | 0.986591 | 0.985662 | 0.988039 | 0.990929 | 0.993260 | 1.002670 | 1.010268 | 1.024333 |
| std | 0.268854 | 0.236902 | 0.241692 | 0.231280 | 0.230254 | 0.222574 | 0.216638 | 0.211052 | 0.203725 | 0.196936 | ... | 0.069261 | 0.066218 | 0.069373 | 0.070197 | 0.065811 | 0.060272 | 0.055941 | 0.043322 | 0.043691 | 0.045567 |
| min | 0.052700 | 0.161000 | 0.150000 | 0.169000 | 0.167000 | 0.174000 | 0.181000 | 0.192000 | 0.199000 | 0.436000 | ... | 0.669000 | 0.655000 | 0.654000 | 0.642000 | 0.646000 | 0.642000 | 0.636000 | 0.728000 | 0.890000 | 0.973000 |
| 25% | 0.785000 | 0.579000 | 0.624500 | 0.655000 | 0.657000 | 0.673500 | 0.667000 | 0.693000 | 0.707000 | 0.713000 | ... | 0.974500 | 0.977000 | 0.979750 | 0.979000 | 0.982000 | 0.983000 | 0.985500 | 0.988000 | 0.989750 | 1.006500 |
| 50% | 0.924000 | 0.857000 | 0.920000 | 0.937000 | 0.926000 | 0.939000 | 0.932000 | 0.936000 | 0.929000 | 0.949000 | ... | 0.995500 | 0.998000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.010000 | 1.010000 | 1.040000 |
| 75% | 1.007250 | 0.984250 | 0.994500 | 1.010000 | 1.000000 | 1.010000 | 1.010000 | 1.010000 | 1.010000 | 1.010000 | ... | 1.020000 | 1.020000 | 1.020000 | 1.020000 | 1.020000 | 1.015000 | 1.010000 | 1.020000 | 1.022500 | 1.050000 |
| max | 1.450000 | 1.110000 | 1.440000 | 1.420000 | 1.180000 | 1.410000 | 1.400000 | 1.420000 | 1.440000 | 1.430000 | ... | 1.130000 | 1.150000 | 1.150000 | 1.130000 | 1.100000 | 1.120000 | 1.120000 | 1.140000 | 1.150000 | 1.060000 |
8 rows × 52 columns
All year columns have gender ratio for primary to secondary schools enrollment data as floats for all countries. So, its good to go. Next, lets deal with NULL values
df_gend_ratio_pr_sec_enrollment.isna().sum()
country 0 1970 170 1971 88 1972 93 1973 96 1974 102 1975 106 1976 99 1977 97 1978 103 1979 105 1980 109 1981 99 1982 105 1983 104 1984 99 1985 103 1986 95 1987 103 1988 103 1989 104 1990 105 1991 104 1992 106 1993 96 1994 104 1995 111 1996 116 1997 134 1998 118 1999 62 2000 68 2001 70 2002 70 2003 74 2004 61 2005 57 2006 66 2007 63 2008 67 2009 69 2010 69 2011 63 2012 68 2013 81 2014 72 2015 71 2016 75 2017 77 2018 81 2019 89 2020 148 2021 201 dtype: int64
What we observe is, most of the recent years i.e. 2020 and old data i.e. before 1998 have most null values and shouldn't be considered while plotting as the analytics won't be correct. But for now, lets replace all rows with NULL values with the mean for each country and remove those countries with all NULL data.
#replace all countries with its individual means, I am doing it row-wise since the trend will match for each country and not year.
df_gend_ratio_pr_sec_enrollment.iloc[:,1:] = df_gend_ratio_pr_sec_enrollment.iloc[:,1:].apply(lambda row: row.fillna(row.mean()), axis=1)
df_gend_ratio_pr_sec_enrollment
# since old data doesn't give any useful picture, I am picking up 2000 to 2020 data.
year_list = list(map(str, range(2000, 2021)))
year_list.insert(0, 'country')
df_gend_ratio_pr_sec_enrollment = df_gend_ratio_pr_sec_enrollment[year_list]
df_gend_ratio_pr_sec_enrollment.head()
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | 0.993000 | 0.999000 | 0.989000 | 0.977000 | 0.953000 | 0.963000 | 0.968000 | 0.987000 | 1.010 | ... | 1.010 | 1.020000 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 |
| 1 | Afghanistan | 0.437389 | 0.000000 | 0.437389 | 0.544000 | 0.406000 | 0.548000 | 0.573000 | 0.567000 | 0.586 | ... | 0.663 | 0.669000 | 0.655000 | 0.654000 | 0.642000 | 0.646000 | 0.642000 | 0.636000 | 0.437389 | 0.437389 |
| 2 | Angola | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.833 | ... | 0.630 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 |
| 3 | Albania | 0.976000 | 0.987000 | 0.974195 | 0.974195 | 0.980000 | 0.978000 | 0.985000 | 0.997000 | 1.000 | ... | 1.000 | 0.990000 | 0.982000 | 0.977000 | 0.982000 | 0.994000 | 1.000000 | 1.020000 | 1.020000 | 1.020000 |
| 4 | Andorra | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090 | ... | 1.090 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 |
5 rows × 22 columns
null_count = df_gend_ratio_pr_sec_enrollment.isna().sum(axis = 1).to_frame()
print("Count of countries with all NULL values : ",len(null_count[null_count[0] == df_gend_ratio_pr_sec_enrollment.shape[1]-1]))
null_indexes = null_count[null_count[0] == df_gend_ratio_pr_sec_enrollment.shape[1]-1].index
if null_indexes.size > 0:
df_gend_ratio_pr_sec_enrollment.drop(null_indexes, inplace = True).reset_index(drop = True)
df_gend_ratio_pr_sec_enrollment.head()
Count of countries with all NULL values : 0
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | 0.993000 | 0.999000 | 0.989000 | 0.977000 | 0.953000 | 0.963000 | 0.968000 | 0.987000 | 1.010 | ... | 1.010 | 1.020000 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 | 0.992071 |
| 1 | Afghanistan | 0.437389 | 0.000000 | 0.437389 | 0.544000 | 0.406000 | 0.548000 | 0.573000 | 0.567000 | 0.586 | ... | 0.663 | 0.669000 | 0.655000 | 0.654000 | 0.642000 | 0.646000 | 0.642000 | 0.636000 | 0.437389 | 0.437389 |
| 2 | Angola | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.833 | ... | 0.630 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 |
| 3 | Albania | 0.976000 | 0.987000 | 0.974195 | 0.974195 | 0.980000 | 0.978000 | 0.985000 | 0.997000 | 1.000 | ... | 1.000 | 0.990000 | 0.982000 | 0.977000 | 0.982000 | 0.994000 | 1.000000 | 1.020000 | 1.020000 | 1.020000 |
| 4 | Andorra | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090 | ... | 1.090 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 |
5 rows × 22 columns
print(df_gend_ratio_pr_sec_ter_yrs.info())
df_gend_ratio_pr_sec_ter_yrs.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 188 entries, 0 to 187 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 188 non-null object 1 1970 188 non-null float64 2 1971 188 non-null float64 3 1972 188 non-null float64 4 1973 188 non-null float64 5 1974 188 non-null float64 6 1975 188 non-null float64 7 1976 188 non-null float64 8 1977 188 non-null float64 9 1978 188 non-null float64 10 1979 188 non-null float64 11 1980 188 non-null float64 12 1981 188 non-null float64 13 1982 188 non-null float64 14 1983 188 non-null float64 15 1984 188 non-null float64 16 1985 188 non-null float64 17 1986 188 non-null float64 18 1987 188 non-null float64 19 1988 188 non-null float64 20 1989 188 non-null float64 21 1990 188 non-null float64 22 1991 188 non-null float64 23 1992 188 non-null float64 24 1993 188 non-null float64 25 1994 188 non-null float64 26 1995 188 non-null float64 27 1996 188 non-null float64 28 1997 188 non-null float64 29 1998 188 non-null float64 30 1999 188 non-null float64 31 2000 188 non-null float64 32 2001 188 non-null float64 33 2002 188 non-null float64 34 2003 188 non-null float64 35 2004 188 non-null float64 36 2005 188 non-null float64 37 2006 188 non-null float64 38 2007 188 non-null float64 39 2008 188 non-null float64 40 2009 188 non-null float64 41 2010 188 non-null float64 42 2011 188 non-null float64 43 2012 188 non-null float64 44 2013 188 non-null float64 45 2014 188 non-null float64 46 2015 188 non-null float64 dtypes: float64(46), object(1) memory usage: 69.2+ KB None
| 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.00000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | ... | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 | 188.000000 |
| mean | 74.872340 | 75.308511 | 75.735106 | 76.148404 | 76.589894 | 77.01383 | 77.428191 | 77.861702 | 78.284574 | 78.700000 | ... | 89.390426 | 89.669149 | 89.972340 | 90.304255 | 90.625532 | 90.894681 | 91.191489 | 91.460106 | 91.720213 | 91.962766 |
| std | 23.531585 | 23.535147 | 23.561424 | 23.535533 | 23.561574 | 23.56049 | 23.545932 | 23.534903 | 23.514179 | 23.467923 | ... | 21.011729 | 20.870095 | 20.741023 | 20.640666 | 20.540589 | 20.402434 | 20.289776 | 20.143014 | 20.022101 | 19.865456 |
| min | 11.200000 | 11.300000 | 11.400000 | 11.900000 | 12.000000 | 12.00000 | 12.400000 | 12.400000 | 12.800000 | 13.100000 | ... | 21.500000 | 21.900000 | 22.200000 | 22.300000 | 22.600000 | 22.900000 | 23.100000 | 23.400000 | 23.500000 | 23.700000 |
| 25% | 54.700000 | 55.325000 | 55.975000 | 56.500000 | 57.000000 | 57.67500 | 58.300000 | 58.850000 | 59.400000 | 59.950000 | ... | 78.450000 | 78.925000 | 79.475000 | 79.975000 | 80.525000 | 81.050000 | 81.475000 | 81.925000 | 82.425000 | 82.925000 |
| 50% | 85.550000 | 86.000000 | 86.350000 | 86.750000 | 87.150000 | 87.55000 | 87.800000 | 88.200000 | 88.650000 | 89.050000 | ... | 99.350000 | 99.600000 | 99.850000 | 100.000000 | 100.000000 | 100.500000 | 101.000000 | 101.000000 | 101.000000 | 101.000000 |
| 75% | 93.900000 | 94.300000 | 94.725000 | 95.100000 | 95.400000 | 95.90000 | 96.225000 | 96.600000 | 96.825000 | 97.300000 | ... | 104.000000 | 104.000000 | 104.000000 | 104.000000 | 104.250000 | 105.000000 | 105.000000 | 105.000000 | 105.000000 | 105.000000 |
| max | 129.000000 | 129.000000 | 129.000000 | 129.000000 | 129.000000 | 129.00000 | 130.000000 | 130.000000 | 129.000000 | 130.000000 | ... | 126.000000 | 127.000000 | 126.000000 | 126.000000 | 126.000000 | 126.000000 | 126.000000 | 126.000000 | 126.000000 | 126.000000 |
8 rows × 46 columns
All year columns have gender ratio for primary, seconday and tertiary school years attended data as floats for all countries. So, its good to go. Lets check for null values.
df_gend_ratio_pr_sec_ter_yrs.isna().sum()
country 0 1970 0 1971 0 1972 0 1973 0 1974 0 1975 0 1976 0 1977 0 1978 0 1979 0 1980 0 1981 0 1982 0 1983 0 1984 0 1985 0 1986 0 1987 0 1988 0 1989 0 1990 0 1991 0 1992 0 1993 0 1994 0 1995 0 1996 0 1997 0 1998 0 1999 0 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 2010 0 2011 0 2012 0 2013 0 2014 0 2015 0 dtype: int64
print(df_emp.info())
df_emp.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 189 entries, 0 to 188 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 189 non-null object 1 1991 189 non-null float64 2 1992 189 non-null float64 3 1993 189 non-null float64 4 1994 189 non-null float64 5 1995 189 non-null float64 6 1996 189 non-null float64 7 1997 189 non-null float64 8 1998 189 non-null float64 9 1999 189 non-null float64 10 2000 189 non-null float64 11 2001 189 non-null float64 12 2002 189 non-null float64 13 2003 189 non-null float64 14 2004 189 non-null float64 15 2005 189 non-null float64 16 2006 189 non-null float64 17 2007 189 non-null float64 18 2008 189 non-null float64 19 2009 189 non-null float64 20 2010 189 non-null float64 21 2011 189 non-null float64 22 2012 189 non-null float64 23 2013 189 non-null float64 24 2014 189 non-null float64 25 2015 189 non-null float64 26 2016 189 non-null float64 27 2017 189 non-null float64 28 2018 189 non-null float64 29 2019 189 non-null float64 30 2020 189 non-null float64 dtypes: float64(30), object(1) memory usage: 45.9+ KB None
| 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | ... | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 | 189.000000 |
| mean | 58.230159 | 58.121164 | 57.801587 | 57.757143 | 57.679365 | 57.511111 | 57.517989 | 57.473016 | 57.360317 | 57.343386 | ... | 57.262434 | 57.297354 | 57.275132 | 57.374074 | 57.521164 | 57.571429 | 57.766138 | 57.940212 | 58.055556 | 55.587302 |
| std | 11.896678 | 11.896820 | 11.946950 | 11.912027 | 11.964471 | 11.964199 | 11.896070 | 11.874309 | 11.832428 | 11.838322 | ... | 11.916831 | 11.924699 | 11.847428 | 11.787706 | 11.686243 | 11.649396 | 11.585708 | 11.586904 | 11.467977 | 11.255951 |
| min | 33.000000 | 32.900000 | 33.100000 | 32.500000 | 30.700000 | 32.100000 | 33.200000 | 32.400000 | 31.400000 | 30.600000 | ... | 33.100000 | 32.900000 | 32.400000 | 31.900000 | 32.300000 | 32.600000 | 32.700000 | 32.100000 | 32.800000 | 30.900000 |
| 25% | 51.100000 | 50.400000 | 49.900000 | 50.100000 | 49.500000 | 49.500000 | 49.400000 | 49.600000 | 48.800000 | 49.100000 | ... | 50.000000 | 50.100000 | 49.700000 | 49.700000 | 49.800000 | 49.500000 | 50.100000 | 50.500000 | 50.400000 | 48.600000 |
| 50% | 57.200000 | 56.900000 | 56.800000 | 57.100000 | 56.500000 | 56.400000 | 56.500000 | 56.500000 | 56.300000 | 56.500000 | ... | 57.200000 | 57.400000 | 57.300000 | 57.200000 | 57.800000 | 57.600000 | 58.000000 | 58.400000 | 58.300000 | 55.600000 |
| 75% | 64.800000 | 64.100000 | 64.600000 | 64.700000 | 64.800000 | 64.700000 | 64.600000 | 64.500000 | 64.700000 | 64.600000 | ... | 64.200000 | 64.300000 | 63.800000 | 64.100000 | 64.300000 | 64.300000 | 64.400000 | 65.100000 | 65.100000 | 62.100000 |
| max | 90.500000 | 89.800000 | 88.700000 | 87.800000 | 86.500000 | 86.400000 | 86.100000 | 85.800000 | 85.400000 | 85.000000 | ... | 86.900000 | 87.800000 | 87.000000 | 87.000000 | 87.800000 | 87.200000 | 86.700000 | 86.600000 | 86.700000 | 83.300000 |
8 rows × 30 columns
All year columns have employement % for 15+ aged people as floats for all countries. So, its good to go. Let's check for NULL values now.
df_emp.isna().sum()
country 0 1991 0 1992 0 1993 0 1994 0 1995 0 1996 0 1997 0 1998 0 1999 0 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 2010 0 2011 0 2012 0 2013 0 2014 0 2015 0 2016 0 2017 0 2018 0 2019 0 2020 0 dtype: int64
print(df_income_per_person.info())
df_income_per_person.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Columns: 252 entries, country to 2050 dtypes: int64(101), object(151) memory usage: 384.0+ KB None
| 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | ... | 1891 | 1892 | 1893 | 1894 | 1895 | 1896 | 1897 | 1898 | 1899 | 1900 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | ... | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 |
| mean | 1072.179487 | 1071.353846 | 1074.210256 | 1073.753846 | 1075.430769 | 1075.174359 | 1076.169231 | 1076.251282 | 1066.497436 | 1067.553846 | ... | 1796.825641 | 1819.466667 | 1839.246154 | 1873.189744 | 1886.882051 | 1917.348718 | 1924.005128 | 1957.410256 | 1981.030769 | 1987.743590 |
| std | 605.373041 | 602.816015 | 614.245612 | 608.330435 | 617.123397 | 610.442792 | 612.281097 | 606.053716 | 564.608763 | 569.926448 | ... | 1431.441055 | 1433.272527 | 1418.456847 | 1441.871804 | 1476.487543 | 1518.852409 | 1516.524093 | 1559.755626 | 1604.763852 | 1616.979912 |
| min | 224.000000 | 224.000000 | 224.000000 | 224.000000 | 225.000000 | 225.000000 | 225.000000 | 225.000000 | 225.000000 | 225.000000 | ... | 334.000000 | 337.000000 | 339.000000 | 342.000000 | 345.000000 | 348.000000 | 351.000000 | 354.000000 | 357.000000 | 360.000000 |
| 25% | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | ... | 883.000000 | 903.000000 | 932.500000 | 935.500000 | 939.000000 | 948.000000 | 948.500000 | 950.500000 | 955.000000 | 957.000000 |
| 50% | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 954.000000 | 954.000000 | 956.000000 | 960.000000 | 960.000000 | 960.000000 | ... | 1250.000000 | 1290.000000 | 1290.000000 | 1300.000000 | 1300.000000 | 1310.000000 | 1310.000000 | 1320.000000 | 1320.000000 | 1330.000000 |
| 75% | 1240.000000 | 1240.000000 | 1240.000000 | 1245.000000 | 1250.000000 | 1250.000000 | 1255.000000 | 1255.000000 | 1255.000000 | 1255.000000 | ... | 2185.000000 | 2220.000000 | 2250.000000 | 2305.000000 | 2340.000000 | 2350.000000 | 2335.000000 | 2430.000000 | 2455.000000 | 2410.000000 |
| max | 4860.000000 | 4780.000000 | 5040.000000 | 4930.000000 | 5170.000000 | 4860.000000 | 4900.000000 | 4490.000000 | 3780.000000 | 3850.000000 | ... | 8650.000000 | 8510.000000 | 7940.000000 | 7960.000000 | 8320.000000 | 8380.000000 | 8610.000000 | 8720.000000 | 9250.000000 | 9340.000000 |
8 rows × 101 columns
Income per person data has all expected float/ integer fields as objects (strings). This is due to the presence of K in the numbers denoting thousands. So a value of 6.6k for 2017 in Australia means its 66000.00.
Let's convert the numbers to their true values by replacing them with their integer values and again run the info and describe methods to ensure they are in correct datatypes.
for column in df_income_per_person.columns[1:]:
is_thousand = df_income_per_person[column].astype(str).str.contains('k', na = False)
df_income_per_person[column].replace("k$","", regex = True, inplace = True)
df_income_per_person[column] = pd.to_numeric(df_income_per_person[column])
df_income_per_person[column] = np.where(is_thousand, df_income_per_person[column] *1000, df_income_per_person[column])
print(df_income_per_person.info())
df_income_per_person.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Columns: 252 entries, country to 2050 dtypes: float64(150), int64(101), object(1) memory usage: 384.0+ KB None
| 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | ... | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 |
| mean | 1072.179487 | 1071.353846 | 1074.210256 | 1073.753846 | 1075.430769 | 1075.174359 | 1076.169231 | 1076.251282 | 1066.497436 | 1067.553846 | ... | 31265.897436 | 31926.974359 | 32628.666667 | 33321.487179 | 34038.717949 | 34775.692308 | 35524.307692 | 36288.205128 | 37069.538462 | 37862.666667 |
| std | 605.373041 | 602.816015 | 614.245612 | 608.330435 | 617.123397 | 610.442792 | 612.281097 | 606.053716 | 564.608763 | 569.926448 | ... | 32907.191950 | 33561.834556 | 34317.547783 | 35048.008199 | 35799.737313 | 36575.963727 | 37360.004510 | 38162.975396 | 38986.791899 | 39818.346586 |
| min | 224.000000 | 224.000000 | 224.000000 | 224.000000 | 225.000000 | 225.000000 | 225.000000 | 225.000000 | 225.000000 | 225.000000 | ... | 1050.000000 | 1070.000000 | 1100.000000 | 1120.000000 | 1140.000000 | 1170.000000 | 1190.000000 | 1220.000000 | 1240.000000 | 1270.000000 |
| 25% | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | 680.500000 | ... | 6550.000000 | 6690.000000 | 6840.000000 | 6985.000000 | 7135.000000 | 7290.000000 | 7445.000000 | 7610.000000 | 7770.000000 | 7935.000000 |
| 50% | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 954.000000 | 954.000000 | 956.000000 | 960.000000 | 960.000000 | 960.000000 | ... | 20000.000000 | 20400.000000 | 20900.000000 | 21300.000000 | 21800.000000 | 22200.000000 | 22700.000000 | 23200.000000 | 23700.000000 | 24200.000000 |
| 75% | 1240.000000 | 1240.000000 | 1240.000000 | 1245.000000 | 1250.000000 | 1250.000000 | 1255.000000 | 1255.000000 | 1255.000000 | 1255.000000 | ... | 49150.000000 | 50250.000000 | 51300.000000 | 52400.000000 | 53550.000000 | 54750.000000 | 55900.000000 | 57100.000000 | 58350.000000 | 59550.000000 |
| max | 4860.000000 | 4780.000000 | 5040.000000 | 4930.000000 | 5170.000000 | 4860.000000 | 4900.000000 | 4490.000000 | 3780.000000 | 3850.000000 | ... | 173000.000000 | 176000.000000 | 180000.000000 | 184000.000000 | 188000.000000 | 192000.000000 | 196000.000000 | 200000.000000 | 205000.000000 | 209000.000000 |
8 rows × 251 columns
The struture is now similar to other dataframes with country and years as columns. This is a little different due to the presence of multiindex but it can be used to select relavant dataeasily. All year columns have income per person data as floats for all countries after the transformations. So, its good to go. Lets check for NULL values now.
df_income_per_person.isna().sum()
country 0
1800 0
1801 0
1802 0
1803 0
..
2046 0
2047 0
2048 0
2049 0
2050 0
Length: 252, dtype: int64
#get sets of both countries and find the common among both using interection of sets
countries_gend_ratio_pr_sec_enrollment = set(df_gend_ratio_pr_sec_enrollment.country)
print(f"No of countries in enrollment dataset is {len(countries_gend_ratio_pr_sec_enrollment)}.")
countries_income_per_person = set(df_income_per_person.country)
print(f"No of countries in income dataset is {len(countries_income_per_person)}.")
common_country_list_rq1 = list(countries_income_per_person.intersection(countries_gend_ratio_pr_sec_enrollment))
print(f"No of common countries among enrollment and income datasets are {len(common_country_list_rq1)}.")
#filter both dataframes to keep only those rows corresponding to common countries
df_gend_ratio_enrollment_rq1 = df_gend_ratio_pr_sec_enrollment[df_gend_ratio_pr_sec_enrollment['country'].isin(common_country_list_rq1)].reset_index(drop = True)
df_income_per_person_rq1 = df_income_per_person[df_income_per_person['country'].isin(common_country_list_rq1)].reset_index(drop = True)
No of countries in enrollment dataset is 204. No of countries in income dataset is 195. No of common countries among enrollment and income datasets are 190.
years_gend_ratio_pr_sec_enrollment = df_gend_ratio_pr_sec_enrollment.columns[1:]
print(f"Range of years in enrollment dataset is {years_gend_ratio_pr_sec_enrollment[0]}-{years_gend_ratio_pr_sec_enrollment[-1]}.")
years_income_per_person = df_income_per_person.columns[1:]
print(f"Range of years in income dataset is {years_income_per_person[0]}-{years_income_per_person[-1]}.")
Range of years in enrollment dataset is 2000-2020. Range of years in income dataset is 1800-2050.
year_list_rq1 = list(map(str, range(2015, 2021)))
year_list_rq1.insert(0, 'country')
df_gend_ratio_enrollment_rq1 = df_gend_ratio_enrollment_rq1[year_list_rq1].sort_values("country").reset_index(drop= True)
df_income_per_person_rq1 = df_income_per_person_rq1[year_list_rq1].sort_values("country").reset_index(drop= True)
display(df_gend_ratio_enrollment_rq1.head())
display(df_income_per_person_rq1.head())
| country | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0.642000 | 0.646000 | 0.642000 | 0.636000 | 0.437389 | 0.437389 |
| 1 | Albania | 0.982000 | 0.994000 | 1.000000 | 1.020000 | 1.020000 | 1.020000 |
| 2 | Algeria | 0.833923 | 0.833923 | 0.833923 | 0.833923 | 0.833923 | 0.833923 |
| 3 | Andorra | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 | 1.090000 |
| 4 | Angola | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 | 0.754889 |
| country | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2070.0 | 2060.0 | 2060.0 | 2030.0 | 2070.0 | 1970.0 |
| 1 | Albania | 11900.0 | 12300.0 | 12800.0 | 13300.0 | 13700.0 | 13200.0 |
| 2 | Algeria | 11700.0 | 11800.0 | 11700.0 | 11600.0 | 11500.0 | 10900.0 |
| 3 | Andorra | 52700.0 | 54500.0 | 56300.0 | 58300.0 | 58400.0 | 51600.0 |
| 4 | Angola | 8040.0 | 7570.0 | 7310.0 | 6930.0 | 6670.0 | 6120.0 |
For this we need to find the mean income for each country, sort them in ascending order and choose the top 5.
df_income_per_person_rq1['mean_income'] = df_income_per_person_rq1.iloc[:,1:].mean(axis= 1)
lowest_income_5 = df_income_per_person_rq1.sort_values('mean_income').head().sort_values("country")
lowest_income_5
| country | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | mean_income | |
|---|---|---|---|---|---|---|---|---|
| 26 | Burundi | 825.0 | 795.0 | 774.0 | 762.0 | 752.0 | 732.0 | 773.333333 |
| 31 | Central African Republic | 853.0 | 884.0 | 913.0 | 933.0 | 945.0 | 937.0 | 910.833333 |
| 37 | Congo, Dem. Rep. | 1070.0 | 1060.0 | 1060.0 | 1090.0 | 1100.0 | 1080.0 | 1076.666667 |
| 121 | Niger | 1130.0 | 1150.0 | 1160.0 | 1200.0 | 1220.0 | 1220.0 | 1180.000000 |
| 152 | Somalia | 1030.0 | 1080.0 | 1080.0 | 1130.0 | 1190.0 | 1150.0 | 1110.000000 |
fig = px.bar(df_income_per_person_rq1, x= "country", y = 'mean_income',
title='Income per person in all countries',
color = "country",
labels = {"mean_income": "Mean Income (International Dollars)", "country": "Country"},
color_discrete_sequence = px.colors.qualitative.Pastel
)
fig.add_trace(go.Scatter(x=lowest_income_5.country, y=lowest_income_5["mean_income"],
mode='markers',
name='Least Income', line=dict(color='firebrick')))
fig.show()
lowest_income_gend_ratio_enrollment = df_gend_ratio_enrollment_rq1[df_gend_ratio_enrollment_rq1['country'].isin(lowest_income_5.country)]
lowest_income_gend_ratio_enrollment = lowest_income_gend_ratio_enrollment.set_index('country').unstack().reset_index().rename(columns = {'level_0': 'year', 0: 'enrollmentRatio'}).sort_values(["country", "year"])[['country', 'year', 'enrollmentRatio']].reset_index(drop= True)
lowest_income_gend_ratio_enrollment
| country | year | enrollmentRatio | |
|---|---|---|---|
| 0 | Burundi | 2015 | 1.000000 |
| 1 | Burundi | 2016 | 1.010000 |
| 2 | Burundi | 2017 | 1.020000 |
| 3 | Burundi | 2018 | 1.030000 |
| 4 | Burundi | 2019 | 1.040000 |
| 5 | Burundi | 2020 | 0.777000 |
| 6 | Central African Republic | 2015 | 0.578955 |
| 7 | Central African Republic | 2016 | 0.761000 |
| 8 | Central African Republic | 2017 | 0.578955 |
| 9 | Central African Republic | 2018 | 0.578955 |
| 10 | Central African Republic | 2019 | 0.578955 |
| 11 | Central African Republic | 2020 | 0.578955 |
| 12 | Congo, Dem. Rep. | 2015 | 0.889000 |
| 13 | Congo, Dem. Rep. | 2016 | 0.685452 |
| 14 | Congo, Dem. Rep. | 2017 | 0.685452 |
| 15 | Congo, Dem. Rep. | 2018 | 0.685452 |
| 16 | Congo, Dem. Rep. | 2019 | 0.685452 |
| 17 | Congo, Dem. Rep. | 2020 | 0.685452 |
| 18 | Niger | 2015 | 0.825000 |
| 19 | Niger | 2016 | 0.831000 |
| 20 | Niger | 2017 | 0.838000 |
| 21 | Niger | 2018 | 0.639667 |
| 22 | Niger | 2019 | 0.639667 |
| 23 | Niger | 2020 | 0.639667 |
| 24 | Somalia | 2015 | 0.483563 |
| 25 | Somalia | 2016 | 0.483563 |
| 26 | Somalia | 2017 | 0.483563 |
| 27 | Somalia | 2018 | 0.483563 |
| 28 | Somalia | 2019 | 0.483563 |
| 29 | Somalia | 2020 | 0.483563 |
mean_gend_ratio_enrollment_per_year = df_gend_ratio_enrollment_rq1.iloc[:,1:].mean(axis= 0).to_frame().reset_index().rename(columns = {'index': 'year', 0: 'meanEnrollmentRatio'})
mean_gend_ratio_enrollment_per_year
| year | meanEnrollmentRatio | |
|---|---|---|
| 0 | 2015 | 0.954438 |
| 1 | 2016 | 0.954636 |
| 2 | 2017 | 0.955408 |
| 3 | 2018 | 0.950834 |
| 4 | 2019 | 0.946897 |
| 5 | 2020 | 0.941240 |
fig = px.line(lowest_income_gend_ratio_enrollment, x= "year", y = 'enrollmentRatio',
title='Gender ratio of students for least 5 gross income countries',
color = "country",
color_discrete_sequence= [px.colors.qualitative.Pastel[1],
px.colors.qualitative.Pastel[7],
px.colors.qualitative.Pastel[3],
px.colors.qualitative.Pastel[4],
px.colors.qualitative.Pastel[5]
],
markers=True, symbol="country",
labels={
"year": "Year",
"enrollmentRatio": "Girls to Boys Enrollment Ratio",
"country": "Country"
})
fig.add_trace(go.Scatter(x=mean_gend_ratio_enrollment_per_year.year, y=mean_gend_ratio_enrollment_per_year.meanEnrollmentRatio,
mode='lines+markers',
name='Mean Enrollment Gender Ratio', line=dict(color='firebrick')))
fig.show()
#get sets of both countries and find the common among both using interection of sets
countries_emp = set(df_emp.country)
print(f"No of countries in employement rate dataset is {len(countries_emp)}.")
countries_gend_ratio_pr_sec_ter_yrs = set(df_gend_ratio_pr_sec_ter_yrs.country)
print(f"No of countries in no of school years dataset is {len(countries_gend_ratio_pr_sec_ter_yrs)}.")
common_country_list_rq2 = list(countries_gend_ratio_pr_sec_ter_yrs.intersection(countries_emp))
print(f"No of common countries among enrollment and income datasets are {len(common_country_list_rq2)}.")
#filter both dataframes to keep only those rows corresponding to common countries
df_emp_rq2 = df_emp[df_emp['country'].isin(common_country_list_rq2)].reset_index(drop = True)
df_gend_ratio_pr_sec_ter_yrs_rq2 = df_gend_ratio_pr_sec_ter_yrs[df_gend_ratio_pr_sec_ter_yrs['country'].isin(common_country_list_rq2)].reset_index(drop = True)
No of countries in employement rate dataset is 189. No of countries in no of school years dataset is 188. No of common countries among enrollment and income datasets are 180.
years_emp = df_emp.columns[1:]
print(f"Range of years in employment rates dataset is {years_emp[0]}-{years_emp[-1]}.")
years_gend_ratio_pr_sec_ter_yrs = df_gend_ratio_pr_sec_ter_yrs.columns[1:]
print(f"Range of years in school years dataset is {years_gend_ratio_pr_sec_ter_yrs[0]}-{years_gend_ratio_pr_sec_ter_yrs[-1]}.")
Range of years in employment rates dataset is 1991-2020. Range of years in school years dataset is 1970-2015.
year_list_rq2 = list(map(str, range(1991, 2016)))
year_list_rq2.insert(0, 'country')
df_emp_rq2 = df_emp_rq2[year_list_rq2].sort_values("country").reset_index(drop= True)
df_gend_ratio_pr_sec_ter_yrs_rq2 = df_gend_ratio_pr_sec_ter_yrs_rq2[year_list_rq2].sort_values("country").reset_index(drop= True)
display(df_emp_rq2.head())
display(df_gend_ratio_pr_sec_ter_yrs_rq2.head())
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 42.5 | 42.5 | 42.5 | 42.5 | 42.4 | 42.4 | 42.3 | 42.2 | 42.2 | ... | 42.9 | 42.8 | 42.7 | 42.4 | 42.2 | 42.3 | 42.4 | 42.5 | 42.7 | 42.9 |
| 1 | Albania | 57.8 | 58.2 | 56.8 | 55.7 | 54.1 | 53.3 | 54.5 | 53.8 | 52.7 | ... | 47.4 | 46.9 | 47.9 | 47.1 | 46.6 | 52.0 | 49.4 | 44.7 | 43.7 | 46.0 |
| 2 | Algeria | 35.4 | 33.7 | 33.1 | 32.5 | 30.7 | 32.1 | 33.2 | 32.4 | 31.4 | ... | 37.2 | 36.5 | 37.4 | 37.8 | 38.1 | 38.3 | 38.0 | 39.4 | 37.3 | 37.2 |
| 3 | Angola | 75.0 | 75.0 | 75.2 | 75.1 | 74.9 | 74.9 | 74.8 | 74.7 | 74.6 | ... | 74.2 | 74.3 | 74.3 | 74.3 | 70.0 | 71.7 | 71.8 | 71.8 | 71.9 | 71.9 |
| 4 | Argentina | 57.3 | 56.9 | 54.9 | 54.0 | 49.5 | 50.7 | 52.5 | 54.1 | 53.1 | ... | 56.0 | 56.3 | 56.0 | 55.8 | 55.7 | 56.3 | 56.1 | 56.0 | 55.4 | 55.5 |
5 rows × 26 columns
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 18.5 | 18.9 | 18.9 | 19.2 | 19.1 | 19.4 | 19.7 | 19.9 | 20.1 | ... | 21.5 | 21.9 | 22.2 | 22.3 | 22.6 | 22.9 | 23.1 | 23.4 | 23.5 | 23.7 |
| 1 | Albania | 96.1 | 96.5 | 96.9 | 97.1 | 97.4 | 97.8 | 98.1 | 98.5 | 98.7 | ... | 100.0 | 101.0 | 101.0 | 101.0 | 101.0 | 102.0 | 102.0 | 102.0 | 102.0 | 103.0 |
| 2 | Algeria | 87.7 | 87.3 | 86.6 | 85.9 | 85.4 | 85.4 | 85.6 | 85.6 | 85.8 | ... | 87.6 | 88.0 | 88.3 | 88.7 | 89.0 | 89.4 | 89.7 | 90.0 | 90.5 | 90.8 |
| 3 | Angola | 60.4 | 60.8 | 61.2 | 61.8 | 62.4 | 62.8 | 63.5 | 63.9 | 64.5 | ... | 68.5 | 68.9 | 69.5 | 70.1 | 70.5 | 71.2 | 71.7 | 72.2 | 72.9 | 73.3 |
| 4 | Argentina | 104.0 | 104.0 | 105.0 | 105.0 | 105.0 | 105.0 | 106.0 | 106.0 | 106.0 | ... | 107.0 | 107.0 | 107.0 | 108.0 | 108.0 | 108.0 | 108.0 | 108.0 | 108.0 | 108.0 |
5 rows × 26 columns
To understand the relationship (positive correlation or negative) among gender ratio of school years and employement rate, and how it changed over these years, we need to find the correlation and next will be comparing it with the mean of gender ratio of number of school years. We will analyze if they both are in same correlation or different.
emp_gend_ratio_yrs_corr = df_emp_rq2.corrwith(df_gend_ratio_pr_sec_ter_yrs_rq2, axis = 0).to_frame().reset_index().rename(columns = {'index': 'year', 0: 'correlationValue'})
emp_gend_ratio_yrs_corr
| year | correlationValue | |
|---|---|---|
| 0 | 1991 | -0.211374 |
| 1 | 1992 | -0.215694 |
| 2 | 1993 | -0.226082 |
| 3 | 1994 | -0.229419 |
| 4 | 1995 | -0.226484 |
| 5 | 1996 | -0.236006 |
| 6 | 1997 | -0.230684 |
| 7 | 1998 | -0.225490 |
| 8 | 1999 | -0.223718 |
| 9 | 2000 | -0.216136 |
| 10 | 2001 | -0.209794 |
| 11 | 2002 | -0.209221 |
| 12 | 2003 | -0.205498 |
| 13 | 2004 | -0.194099 |
| 14 | 2005 | -0.175256 |
| 15 | 2006 | -0.154981 |
| 16 | 2007 | -0.133490 |
| 17 | 2008 | -0.120964 |
| 18 | 2009 | -0.140309 |
| 19 | 2010 | -0.140064 |
| 20 | 2011 | -0.128902 |
| 21 | 2012 | -0.112533 |
| 22 | 2013 | -0.106454 |
| 23 | 2014 | -0.089712 |
| 24 | 2015 | -0.081037 |
#finding mean of correlation for all years
emp_gend_ratio_yrs_corr['meanRatio'] = df_gend_ratio_pr_sec_ter_yrs_rq2.iloc[:,1:].mean(axis= 0).to_frame().reset_index(drop=True).rename(columns = { 0: 'meanSchoolYearsRatio'})
#standardizing the mean as per the scale of correlationValue for better visualization
emp_gend_ratio_yrs_corr['standardizedMeanRatio'] = ((emp_gend_ratio_yrs_corr['meanRatio'] - emp_gend_ratio_yrs_corr['meanRatio'].min()) * 0.15 )/ (emp_gend_ratio_yrs_corr['meanRatio'].max() - emp_gend_ratio_yrs_corr['meanRatio'].min()) - 0.24
emp_gend_ratio_yrs_corr
| year | correlationValue | meanRatio | standardizedMeanRatio | |
|---|---|---|---|---|
| 0 | 1991 | -0.211374 | 83.071111 | -0.240000 |
| 1 | 1992 | -0.215694 | 83.503889 | -0.232181 |
| 2 | 1993 | -0.226082 | 83.945000 | -0.224211 |
| 3 | 1994 | -0.229419 | 84.351111 | -0.216874 |
| 4 | 1995 | -0.226484 | 84.733889 | -0.209958 |
| 5 | 1996 | -0.236006 | 85.145000 | -0.202530 |
| 6 | 1997 | -0.230684 | 85.548333 | -0.195243 |
| 7 | 1998 | -0.225490 | 85.941111 | -0.188146 |
| 8 | 1999 | -0.223718 | 86.291667 | -0.181813 |
| 9 | 2000 | -0.216136 | 86.682222 | -0.174756 |
| 10 | 2001 | -0.209794 | 87.045000 | -0.168202 |
| 11 | 2002 | -0.209221 | 87.411667 | -0.161577 |
| 12 | 2003 | -0.205498 | 87.713889 | -0.156117 |
| 13 | 2004 | -0.194099 | 88.073889 | -0.149613 |
| 14 | 2005 | -0.175256 | 88.424444 | -0.143279 |
| 15 | 2006 | -0.154981 | 88.756111 | -0.137287 |
| 16 | 2007 | -0.133490 | 89.038889 | -0.132177 |
| 17 | 2008 | -0.120964 | 89.352222 | -0.126516 |
| 18 | 2009 | -0.140309 | 89.685000 | -0.120504 |
| 19 | 2010 | -0.140064 | 90.006111 | -0.114702 |
| 20 | 2011 | -0.128902 | 90.285000 | -0.109663 |
| 21 | 2012 | -0.112533 | 90.588333 | -0.104183 |
| 22 | 2013 | -0.106454 | 90.851111 | -0.099435 |
| 23 | 2014 | -0.089712 | 91.121111 | -0.094557 |
| 24 | 2015 | -0.081037 | 91.373333 | -0.090000 |
Let's plot the yearly trend to see how correlation changed accross 1991 to 2015.
fig = px.line(emp_gend_ratio_yrs_corr, x= "year", y = 'correlationValue',
title='Yearly trend of relationship between employement rate and school enrollment ratio',
labels = {"correlationValue": "Correlation Value", "country": "Country", "year": "Year"},
color_discrete_sequence = px.colors.qualitative.Pastel,
markers = True
)
fig.add_trace(go.Scatter(x=emp_gend_ratio_yrs_corr.year, y=emp_gend_ratio_yrs_corr.standardizedMeanRatio,
mode='lines+markers',
# name='Mean School years Ratio',
line=dict(color='firebrick')))
fig.layout.update(showlegend=False)
fig.show()
Next, lets observe the countrywise correlation between school years and employement rate and see where the impact has most positive correlation and where is it most negative. This helps us visualize where in our world improvement in ratio has a positive/ negative impact on employement.
common_country_list_rq2.sort()
common_country_dict_rq2 = {k: v for k, v in enumerate(common_country_list_rq2)}
common_country_dict_rq2
{0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'Angola',
4: 'Argentina',
5: 'Armenia',
6: 'Australia',
7: 'Austria',
8: 'Azerbaijan',
9: 'Bahamas',
10: 'Bahrain',
11: 'Bangladesh',
12: 'Barbados',
13: 'Belarus',
14: 'Belgium',
15: 'Belize',
16: 'Benin',
17: 'Bhutan',
18: 'Bolivia',
19: 'Bosnia and Herzegovina',
20: 'Botswana',
21: 'Brazil',
22: 'Brunei',
23: 'Bulgaria',
24: 'Burkina Faso',
25: 'Burundi',
26: 'Cambodia',
27: 'Cameroon',
28: 'Canada',
29: 'Cape Verde',
30: 'Central African Republic',
31: 'Chad',
32: 'Chile',
33: 'China',
34: 'Colombia',
35: 'Comoros',
36: 'Congo, Dem. Rep.',
37: 'Congo, Rep.',
38: 'Costa Rica',
39: "Cote d'Ivoire",
40: 'Croatia',
41: 'Cuba',
42: 'Cyprus',
43: 'Czech Republic',
44: 'Denmark',
45: 'Djibouti',
46: 'Dominican Republic',
47: 'Ecuador',
48: 'Egypt',
49: 'El Salvador',
50: 'Equatorial Guinea',
51: 'Eritrea',
52: 'Estonia',
53: 'Eswatini',
54: 'Ethiopia',
55: 'Fiji',
56: 'Finland',
57: 'France',
58: 'Gabon',
59: 'Gambia',
60: 'Georgia',
61: 'Germany',
62: 'Ghana',
63: 'Greece',
64: 'Guatemala',
65: 'Guinea',
66: 'Guinea-Bissau',
67: 'Guyana',
68: 'Haiti',
69: 'Honduras',
70: 'Hungary',
71: 'Iceland',
72: 'India',
73: 'Indonesia',
74: 'Iran',
75: 'Iraq',
76: 'Ireland',
77: 'Israel',
78: 'Italy',
79: 'Jamaica',
80: 'Japan',
81: 'Jordan',
82: 'Kazakhstan',
83: 'Kenya',
84: 'Kuwait',
85: 'Kyrgyz Republic',
86: 'Lao',
87: 'Latvia',
88: 'Lebanon',
89: 'Lesotho',
90: 'Liberia',
91: 'Libya',
92: 'Lithuania',
93: 'Luxembourg',
94: 'Madagascar',
95: 'Malawi',
96: 'Malaysia',
97: 'Maldives',
98: 'Mali',
99: 'Malta',
100: 'Mauritania',
101: 'Mauritius',
102: 'Mexico',
103: 'Moldova',
104: 'Mongolia',
105: 'Montenegro',
106: 'Morocco',
107: 'Mozambique',
108: 'Myanmar',
109: 'Namibia',
110: 'Nepal',
111: 'Netherlands',
112: 'New Zealand',
113: 'Nicaragua',
114: 'Niger',
115: 'Nigeria',
116: 'North Korea',
117: 'North Macedonia',
118: 'Norway',
119: 'Oman',
120: 'Pakistan',
121: 'Palestine',
122: 'Panama',
123: 'Papua New Guinea',
124: 'Paraguay',
125: 'Peru',
126: 'Philippines',
127: 'Poland',
128: 'Portugal',
129: 'Qatar',
130: 'Romania',
131: 'Russia',
132: 'Rwanda',
133: 'Samoa',
134: 'Sao Tome and Principe',
135: 'Saudi Arabia',
136: 'Senegal',
137: 'Serbia',
138: 'Sierra Leone',
139: 'Singapore',
140: 'Slovak Republic',
141: 'Slovenia',
142: 'Solomon Islands',
143: 'Somalia',
144: 'South Africa',
145: 'South Korea',
146: 'South Sudan',
147: 'Spain',
148: 'Sri Lanka',
149: 'St. Lucia',
150: 'St. Vincent and the Grenadines',
151: 'Sudan',
152: 'Suriname',
153: 'Sweden',
154: 'Switzerland',
155: 'Syria',
156: 'Taiwan',
157: 'Tajikistan',
158: 'Tanzania',
159: 'Thailand',
160: 'Timor-Leste',
161: 'Togo',
162: 'Tonga',
163: 'Trinidad and Tobago',
164: 'Tunisia',
165: 'Turkey',
166: 'Turkmenistan',
167: 'Uganda',
168: 'Ukraine',
169: 'United Arab Emirates',
170: 'United Kingdom',
171: 'United States',
172: 'Uruguay',
173: 'Uzbekistan',
174: 'Vanuatu',
175: 'Venezuela',
176: 'Vietnam',
177: 'Yemen',
178: 'Zambia',
179: 'Zimbabwe'}
emp_gend_ratio_yrs_country_corr = df_emp_rq2.corrwith(df_gend_ratio_pr_sec_ter_yrs_rq2, axis = 1).to_frame().reset_index().rename(columns = {'index': 'country', 0: 'correlationValue'})
emp_gend_ratio_yrs_country_corr.replace(common_country_dict_rq2, inplace = True)
emp_gend_ratio_yrs_country_corr
| country | correlationValue | |
|---|---|---|
| 0 | Afghanistan | 0.323407 |
| 1 | Albania | -0.895268 |
| 2 | Algeria | 0.895060 |
| 3 | Angola | -0.809776 |
| 4 | Argentina | 0.243432 |
| ... | ... | ... |
| 175 | Venezuela | 0.345778 |
| 176 | Vietnam | 0.152598 |
| 177 | Yemen | -0.934838 |
| 178 | Zambia | 0.602975 |
| 179 | Zimbabwe | 0.385582 |
180 rows × 2 columns
Let's see the most positively correlated countries to most negatively correlated countries. We can further analyse what happened in those countries to understand the trend better
high_neg_corr = emp_gend_ratio_yrs_country_corr.sort_values("correlationValue").head()
high_pos_corr = emp_gend_ratio_yrs_country_corr.sort_values("correlationValue").tail()
display(high_neg_corr)
display(high_pos_corr)
| country | correlationValue | |
|---|---|---|
| 143 | Somalia | -0.993291 |
| 24 | Burkina Faso | -0.985139 |
| 33 | China | -0.984672 |
| 108 | Myanmar | -0.981745 |
| 100 | Mauritania | -0.979770 |
| country | correlationValue | |
|---|---|---|
| 150 | St. Vincent and the Grenadines | 0.954454 |
| 88 | Lebanon | 0.955298 |
| 41 | Cuba | 0.959584 |
| 125 | Peru | 0.972933 |
| 91 | Libya | 0.987849 |
clrred = 'rgb(222, 0, 0)'
clrgrn = 'rgb(0, 222, 0)'
clrs = [clrred if c in list(high_neg_corr.country) else clrgrn if c in list(high_pos_corr.country) else px.colors.qualitative.Pastel[1] for c in emp_gend_ratio_yrs_country_corr.country]
fig = go.Figure(data=[go.Bar( x= emp_gend_ratio_yrs_country_corr["country"], y = emp_gend_ratio_yrs_country_corr['correlationValue'],
marker_color = clrs
)])
fig.update_layout(xaxis_tickangle=-45, title='Correlation of Employement Rate and Gender Ratio of School Years', xaxis_title="Country",
yaxis_title="Correlation")
fig.show()
The green and red bars denote the highest positive correlation and lowest negative correlations among the 180 countries.
On whole, the education state of the world is improving and the related employability situation as well.